In this notebook we show how samples are positioned in N-dimensional space. Dividing this space into regions allows us to classify (guess the type) of novel, as-of-yet unseen samples.
In this example we are going to train a classifier on Fisher's Iris Dataset and plot its decision surface.
First, some imports:
In [13]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
In [8]:
# Load
iris = load_iris()
# Get features and ground truth
X = iris.data
Y = iris.target
# Extract sepal length and petal width features (0 and 3)
X03 = np.append(X[:,0:1], X[:,3:4], axis=1)
In [16]:
# Figure
plt.figure(figsize=(10,8))
# Scatter plot of the data points
plt.scatter(X[Y==0,0], X[Y==0,3], color='red')
plt.scatter(X[Y==1,0], X[Y==1,3], color='blue')
plt.scatter(X[Y==2,0], X[Y==2,3], color='green')
# Labels
plt.title('Fisher\'s iris datset')
plt.xlabel('Sepal length')
plt.ylabel('Petal width')
# Show
plt.show()
If we were given measurements of a flower of unknown species, how could we predict its species from its measurements?
We would need to divide the space seen in the plot above into regions in which a flower is most likely to be member of that species.
The classifiers in Scikit-learn do just this; they compute parameters to control the way in which an algorithm divides the space up into regions.
This function plots the data points in the data set in addition to plotting the decision surface that shows the regions of space that the classifier will assign to the 3 classes.
In [18]:
def clf_decision_surface_plot(clf):
# Generate the co-ordinates at which we will ask the classifier to predict
# the probabilities of the classes
xx, yy = np.meshgrid(np.arange(4.0, 8.5, 0.01),
np.arange(-0.5, 3.0, 0.01))
# Predict classes for all grid points
Y_pred = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# New figure
plt.figure(figsize=(10,8))
# Plot the contours of the boundaries corresponding to each class
plt.contourf(xx, yy, (Y_pred == 0).reshape(xx.shape), 1, cmap='OrRd', alpha=.4)
plt.contourf(xx, yy, (Y_pred == 1).reshape(xx.shape), 1, cmap='Blues', alpha=.4)
plt.contourf(xx, yy, (Y_pred == 2).reshape(xx.shape), 1, cmap='Greens', alpha=.4)
# Plot the data points
plt.scatter(X[Y==0,0], X[Y==0,3], color='red')
plt.scatter(X[Y==1,0], X[Y==1,3], color='blue')
plt.scatter(X[Y==2,0], X[Y==2,3], color='green')
# Labels
plt.title('Fisher\'s iris datset')
plt.xlabel('Sepal length')
plt.ylabel('Petal width')
# Show
plt.show()
In [19]:
# Build a nearest neighbours classifier that uses the 3 nearest neighbours to make a decision
nn_clf = KNeighborsClassifier(3)
# Train on the dataset
nn_clf.fit(X03, Y)
# Plot the decision surface
clf_decision_surface_plot(nn_clf)
In [25]:
# Linear classifier
lclf = LinearSVC()
# Train
lclf.fit(X03, Y)
# Plot the decision surface
clf_decision_surface_plot(lclf)
In [26]:
fclf = RandomForestClassifier()
fclf.fit(X03, Y)
clf_plot(fclf)
The random forest classifier is composed of decisions trees that split the space along lines that are aligned to the X and Y axes.